-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add persistent retry queue for failed telemetry events (#4940) #6454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Implement TelemetryQueue class with VSCode globalState persistence - Add exponential backoff retry logic (1s to 5min) - Queue failed events from PostHog and Cloud telemetry clients - Implement queue size limits (1000 events max, ~1MB disk usage) - Add graceful shutdown with 5-second timeout - Ensure queue persists across extension restarts - Add comprehensive test coverage (99 tests passing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a persistent retry queue for failed telemetry events to prevent data loss during network outages or connectivity issues. The queue persists events across VSCode restarts and implements exponential backoff for retries.
Key changes:
- Added
TelemetryQueueclass with persistent storage and exponential backoff retry logic - Modified telemetry clients to integrate with the queue for failed events
- Added comprehensive test coverage for queue functionality and client integration
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/extension.ts |
Initializes telemetry queue on activation and flushes on deactivation |
packages/telemetry/src/TelemetryQueue.ts |
Core queue implementation with persistent storage and retry logic |
packages/telemetry/src/TelemetryService.ts |
Added queue initialization, management, and shutdown methods |
packages/telemetry/src/BaseTelemetryClient.ts |
Added abstract captureWithRetry method and queue integration |
packages/telemetry/src/PostHogTelemetryClient.ts |
Implemented retry logic and queue integration |
packages/cloud/src/TelemetryClient.ts |
Implemented retry logic and queue integration |
packages/telemetry/src/index.ts |
Exported TelemetryQueue class |
| Test files | Added comprehensive test coverage for queue and client integration |
| } | ||
|
|
||
| // Attempt to send using the client's retry method | ||
| return await client["captureWithRetry"](event.event) |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bracket notation to access a protected method is a code smell. Consider making captureWithRetry public or providing a proper public method for retry operations.
| if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") { | ||
| return true | ||
| } | ||
| if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") { |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using constructor.name for type checking is fragile and can break with minification. Consider using instanceof checks or adding a type property to the clients.
| if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") { | |
| return true | |
| } | |
| if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") { | |
| if (event.clientType === "posthog" && c.type === "posthog") { | |
| return true | |
| } | |
| if (event.clientType === "cloud" && c.type === "cloud") { |
| if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") { | ||
| return true | ||
| } | ||
| if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") { |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using constructor.name for type checking is fragile and can break with minification. Consider using instanceof checks or adding a type property to the clients.
| if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") { | |
| return true | |
| } | |
| if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") { | |
| if (event.clientType === "posthog" && c instanceof PostHogTelemetryClient) { | |
| return true | |
| } | |
| if (event.clientType === "cloud" && c instanceof TelemetryClient) { |
| // Distribute queue to all clients | ||
| this.clients.forEach((client) => { | ||
| if (client instanceof BaseTelemetryClient) { | ||
| client.setQueue(this.queue!) |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the non-null assertion operator (!) can be dangerous. Consider checking if queue exists before using it or restructuring the code to ensure queue is always defined at this point.
| client.setQueue(this.queue!) | |
| if (this.queue) { | |
| client.setQueue(this.queue); | |
| } else { | |
| console.error("Queue is not initialized. Unable to set queue for client."); | |
| } |
| event.retryCount++ | ||
| event.lastAttempt = now | ||
| // Calculate next attempt with exponential backoff | ||
| const backoffMs = Math.min(1000 * Math.pow(2, event.retryCount), 300000) // Max 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing the literal values (1000 and 300000) in the exponential backoff calculation with the defined constants (TelemetryQueue.INITIAL_BACKOFF_MS and TelemetryQueue.MAX_BACKOFF_MS) to ensure consistency and easier maintenance.
| // Set up retry callback | ||
| this.queue.setRetryCallback(async (event: QueuedTelemetryEvent) => { | ||
| // Find the appropriate client for this event | ||
| const client = this.clients.find((c) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on c.constructor.name to determine client type is brittle. Consider using an explicit identifier (or instanceof checks) to distinguish between PostHog and Cloud telemetry clients.
| } | ||
|
|
||
| // Attempt to send using the client's retry method | ||
| return await client["captureWithRetry"](event.event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the protected method captureWithRetry using bracket notation (client["captureWithRetry"]) can be fragile. Consider exposing a dedicated public retry method or refactoring to avoid bypassing access modifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing the persistent retry queue for telemetry events! This is a solid implementation that addresses the requirements from issue #4940. I've reviewed the changes and found several areas that could benefit from improvements to enhance code quality and prevent potential issues.
|
|
||
| // Start the timer if not already running | ||
| if (!this.isRunning) { | ||
| this.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? The start() method is called without checking if processing is already in progress. Could we add a guard to prevent multiple timers from running simultaneously? This could lead to race conditions where multiple executeCallback() calls overlap.
| } | ||
|
|
||
| // Attempt to send using the client's retry method | ||
| return await client["captureWithRetry"](event.event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we improve the type safety here? Direct access to the private method captureWithRetry using bracket notation bypasses TypeScript's protection. Consider making this method protected or adding a public retry interface to BaseTelemetryClient.
| event.retryCount++ | ||
| event.lastAttempt = now | ||
| // Calculate next attempt with exponential backoff | ||
| const backoffMs = Math.min(1000 * Math.pow(2, event.retryCount), 300000) // Max 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this backoff calculation intentionally different from the one on line 241? We have two different exponential backoff formulas: this one uses event.retryCount directly, while the other uses attemptCount. Could we standardize these to avoid confusion?
| event.retryCount++ | ||
| event.lastAttempt = now | ||
| // Calculate next attempt with exponential backoff | ||
| const backoffMs = Math.min(1000 * Math.pow(2, event.retryCount), 300000) // Max 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extract this backoff calculation into a helper method? The same formula Math.min(1000 * Math.pow(2, event.retryCount), 300000) is repeated on line 141. This would improve maintainability and reduce the chance of inconsistencies.
| * @param event The telemetry event that failed to send | ||
| * @param clientType The client type that should process this event | ||
| */ | ||
| public async addEvent(event: TelemetryEvent, clientType: "posthog" | "cloud"): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add validation for the event parameter? There's no check that event is not null/undefined before processing, which could cause runtime errors if called incorrectly.
| this.queue = new TelemetryQueue(context) | ||
|
|
||
| // Set up retry callback | ||
| this.queue.setRetryCallback(async (event: QueuedTelemetryEvent) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this closure pattern intentional? The retry callback creates a closure over the entire service instance, which could prevent garbage collection. Could we consider using a WeakRef or extracting this to a separate method to reduce memory retention?
| private static readonly MAX_QUEUE_SIZE = 1000 | ||
| private static readonly MAX_RETRY_COUNT = 10 | ||
| private static readonly SUCCESS_INTERVAL = 30000 // 30 seconds | ||
| private static readonly INITIAL_BACKOFF_MS = 1000 // 1 second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add documentation for these constants? Values like 1000 and 300000 would benefit from comments explaining they represent 1 second and 5 minutes respectively. Consider making them configurable if different retry strategies might be needed in the future.
| /** | ||
| * Starts the queue processing timer | ||
| */ | ||
| public start(): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add JSDoc documentation for this public method? It would help other developers understand when and why to call start(), especially since it's called automatically in addEvent().
Summary
Implements a persistent retry queue for failed telemetry events to prevent data loss during network outages or connectivity issues. The queue persists events across VSCode restarts and implements exponential backoff for retries.
Fixes #4940
Changes
New Features
TelemetryQueueclass with persistent storage using VSCode's globalStateModified Files
captureWithRetrymethodTests
Testing
Automated Tests
All tests passing:
Manual Testing Checklist
Test network failure recovery
Test extension restart with queued events
Test queue size limits
Test graceful shutdown
Review Notes
Issues Fixed During Review
_eventspropertyDesign Decisions
packages/cloud/src/RefreshTimer.tsto avoid circular dependencies between packages. This could be refactored to a shared utility package in the future.Success Criteria Met
✅ Events queued when network fails
✅ Retry with exponential backoff
✅ Queue persists across restarts
✅ Queue has size limits (1000 events)
✅ Graceful shutdown with timeout
✅ User visibility via getQueueStatus
✅ Minimal performance impact
✅ Comprehensive test coverage
✅ No breaking changes
✅ Proper error handling
Screenshots/Recordings
N/A - Backend telemetry feature
Related Issues
Checklist
Important
Introduces a persistent retry queue for telemetry events with exponential backoff, integrating with existing telemetry clients and ensuring data persistence across VSCode restarts.
TelemetryQueueclass for persistent retry of telemetry events using VSCode'sglobalState.PostHogTelemetryClientandCloudTelemetryClient.TelemetryQueue.ts: Core queue implementation.BaseTelemetryClient.ts,PostHogTelemetryClient.ts,TelemetryClient.ts: Add queue integration and retry logic.TelemetryService.ts: Manages queue initialization and shutdown.extension.ts: Initializes queue on activation, flushes on deactivation.TelemetryQueue.test.ts,PostHogTelemetryClient.test.ts,TelemetryService.test.ts, andTelemetryClient.test.ts.This description was created by
for ea718de. You can customize this summary. It will automatically update as commits are pushed.